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Abstract-At present, 2-D DCT is applied widely in the field of 
signal processing. But the transform actually operates 1-D DCT to 
the rows and columns of 2-D data successively, which limits the 
transform speed to farther improve. To overcome such drawback, 
a parallel computing method is proposed in the paper. First, some 
new matrix operation algorithms and a new transform matrix are 
defined. Then, 2-D SDCT (Submatrix Discrete Cosine Transform) 
is operated integrally based on the new transform matrix and new 
matrix operation algorithm. Finally, the parallel computing of 2-D 
DCT is analyzed based on the characteristic of 2-D SDCT. The 
theoretical analysis shows that the calculating amount of the 
proposed method only needs one time multiplication and a few 
times additions, and the transform speed relative to other fast 
algorithms is improved notably. 

Keywords-2-D SDCT, Fast Algorithm, Parallel Computing, 
Serial Computing. 

I. Introduction 

According to mean square error (MSE) criteria, K-L is the 
optimal orthogonal transform for first-order Markov process at 
present [ ] ' But the transform needs statistical information of the 
sampled data and hasn't fast algorithm. However, DCT doesn't 
depend on signal itself, transform performance is very close to 
the K-L and suitable for hardware processing. So the DCT is 
applied widely in the field of digital signal processing, and 
especially DCT has become the core part of motion or static 
image compression standard, such as JPEG, MPEG, H.26x etc. 

But the computing process is complex, which needs a large 
amount of multiplication operation and addition operation, 
affects the transform efficiency markedly. The calculating 
amount will be tremendous and is hard to reach real time 
processing if the DCT is applied directly without improvement. 

According to the symmetry of transform matrix, Chen 1 - 2 - 1 
firstly proposed a fast DCT algorithm based on sparse matrix 
decomposition in 1977. And after that, many better fast 
algorithms has occurred successively [3 ~ 10] . The most influential 
algorithms are Vetterli algorithm [8], Feig algorithm [9] etc. 
Most of the fast 2-D DCT algorithms perform transform by 
twice 1-D DCT successively, and aim to reduce times of 
addition operation and multiplication operation, especially 
multiply operation. Although the times of addition operation 
and multiply operation can be reduced to a certain extent, twice 
1-D DCT operated successively will still limit to improve 
further the transform speed. 

At the present time, software technology and hardware 
technology related to parallel computing are increasingly 
mature, and the parallel computing is becoming the main theme 
of scientific and engineering computation in the 2 1 st century, so 



the research on fast 2-D DCT based on parallel computing 
appears more significant. Therefore, a parallel computing idea 
of 2-D SDCT (Submatrix Discrete Cosine Transform) is 
proposed in the paper, which input the 2-D signal integrally and 
output the transform result integrally in order to increase the 
computing speed. 

The paper is organized as follows: The next section 
describes 2-D matrix expression and matrix operation method. 
Section III describes traditional operation method of 2-D DCT. 
Section IV describes 2-D SDCT algorithm. Section V 
discusses the parallel computing of 2-D SDCT in detail. 
Section A/I analyses the algorithm performance of 2-D SDCT. 
The last section gives some conclusions. 

II. Matrix Expression and Operation 

In order to describe conveniently, matrix expression form 
and some matrix operation methods need to define first [11 ' u \ 

A. Matrix Expression 

In order to express conveniently, the bold letters, A, B etc or 
the abbreviation, [ay], [by] etc, are adopted to describe a matrix, 
where z and j represent the row and column of an element in 
2-D matrix. When total number of rows, R, and total number of 
columns, C, are necessary to state, A RxC or [aij] RxC is adopted to 
denote. In addition, a larger matrix can be divided into 
sub-matrices (or sub-blocks). The matrix with sub-matrix for 
element is called partitioned matrix. 

The sub-matrix can be expressed as A rc ^ RxC , where, R*C 
indicates sub-matrix size, the r and c indicate sub-matrix 
position in partitioned matrix. A certain element in sub-matrix 
be expressed as ay tK , where, r and c indicate sub-matrix 
position in partitioned matrix, z and j indicate an element 
position in sub-matrix. For example, if a 4x4 matrix is divided 
into 4 sub-matrices, equation (1) is adopted to describe. 



A 4x . 



,2x2 

,2x2 



(1) 



B. Matrix Operation Method 

Definition 1: For a matrix, A=(afc/ >rc ), where l<r<R, l<c<C, 
l<k<K and \<1<L, the "sub-matrix element position" 
operation of the matrix is marked as A L . 



: ( a rc,kl ) 



(2) 



The matrix A is called "position matrix" of matrix A. For 
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example, for a 4x4 matrix, A, the matrix can be divided into 4 
same type sub-matrices. The "sub-matrix element position" 
operation of the matrix can be described as follows: 



IAx< 



Obviously, the operation is based on sub-matrix, and 
demands each sub-matrix be the same size. 

Definition 2: For two matrices, A=(a rc ) and B=(b rc ), where 
\<r<R and l<c<C, the "matrix element product" operation of 
the two matrices is marked as AVE, and the "matrix element 
product" operation is defined as equation (3). The operation 
result is a lxl matrix, that is, the result is a constant 



r 




J 




_. 


a UH a 12,ll 


a U12 a 12,12 




a Ull a U12 


a 12,ll a 12,12 


a 2Ul a 22,ll 


a 2U2 a 22,12 
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a U21 a U22 


a 12,21 a 12,22 


a U21 a 12,21 


a U22 a 12,22 


a 2Ul a 2U2 


a 22,ll a 22,12 


a 21,21 a 22,21 


a 21,22 a 22,22 




a 21,21 a 21,22 


a 22,21 a 22,22 



AVB = ££o r Ac 



(3) 



Definition 3: For two matrices, A=(a rc ) and B=(b rc ^i), where 
l<r<R, l<c<C, l<k<K and 1</<L, the "sub-matrix' product" 

operation of the two matrices is marked as A>B, and which 
result is a KxL matrix. The "sub-matrix product" operation is 
defined as equation (4). 





A \RxC ' * " 


B[L,RxC 


: > 






Bki,RxC '" Bkl^rxC 


A?xC l> -^LL,RxCc 




\ 


kc >b kl,rxc_ 


KxL 



(4) 



A?xC > -^(Ki?)x(LQ _ 

ArxC > B[\RxC 
^RxC > ^K\,RxC 



The "sub-matrix product" operation satisfies the following 
operation rules (Suppose A and B both be R*C matrix, and C 
be a (KR)x(LC) matrix): 

(A+B)>C=A>C+B>C 

Note: (1) In the processing, matrix B (larger matrix) needs 
to be divided into partitioned matrix according to the matrix A 
(small matrix) size, that is, the matrix A and the sub-matrices in 
matrix B are the same size. 

(2) When the R, C and T are equal to 1, the "sub-matrix 
product" operation simplify as "matrix element product" 
operation. So the "matrix element product" operation is a 
special case of the "sub-matrix product" operation. 

Definition 4: For two matrices, A=(a rc ) and B=(b k if C ), where 
l<r<R, l<c<C, l<k<K and 1</<L, the "matrix superimposition 

product" operation of the two matrices is marked as A<B, and 
which is defined as equation (5). The operation result is a KxL 
matrix. 



<B, 



\KRXLC) - 



a n "• a \c 
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B 


a R\ '" a RC_ 




l B i 


R C 

/ , / , a rc^rc.KxL 
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KxL 



(5) 



The "matrix superimposition product" operation satisfies 
the following operation rules (Suppose A and B both be RxC 
matrix, and C be (KR)x(LC) matrix): 

(A+B)<C=A<C+B<C 

Note: (1) In the processing of "matrix superimposition 
product" operation, matrix B (larger matrix) needs to be 
divided into partitioned matrix according to the element total of 
matrix A (small matrix), that is, the sub-matrix total of matrix B 
equal to the element total of matrix A. 

(2) When the I, J and K all are equal to 1, the "matrix 
superimposition product" operation simplify as "matrix 
element product" operation. So the "matrix element product" 
operation is a special case of the "matrix superimposition 
product" operation too. 

III. 2-D DCT Algorithm 

Set the 2-D signal as f(r,c), r=0,\,...,R—\, c=0,l,...,C— 1, 
and the 2-D DCT are described by equation (6) and equation (7) 
respectively. 



F(u,v) = J]J] f(r,c)l —c(u) cos 



c(v)cos 



;r(2r + l)u 
2^ _ 

;r(2c + l)v 



2C 



f(r,0 = ZZ F ("' v ) 



-c(u)cos 



c(v)cos 



n{lr + l)u 



2R 

n{lc + l)v 



(6) 



(7) 



Where, 



c(u)-. 



1/V2 
1 



u = 
other 



c(v) = 



1/V2 
1 



2C 



v = 
other 



Because the 2-D DCT has separable transform properties, 
the operation of 2-D DCT can be accomplished by computing 
1-D DCT of rows and columns of 2-D signal successively, 
which is show as equation (8) and equation (9) respectively. 



F(u,v) 

~;r(2r + l)u 



y fi-i r iy c-i 
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2C 



2R 
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R 

/r(2r + l)u 
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The equation (8) and equation (9) is called conventional 
serial algorithm (CSA), and the separable transform form 
becomes the 2-D DCT core thought of serial computing. 

From the equation (8) and equation (9), we can see that the 
operation of 2-D DCT needs 2N 3 times multiplication and 
2N 2 (N-l) times addition if the DCT is applied directly without 
improvement. If the 2-D signal size is 8x8, the total calculating 
amount is 1024 times multiplication and 896 times addition. 



Set 



Trxr- 



Q,lxR 

Ql,\xR 



Qr 



JxR 



^CxC~ 



Q,lxC 



Qc 



\xC 



(10) 



Where, 



Q(r,u)-. 



-c(u) COS 



/r(2r + l)u 



2N 



The Q(r,u) is called DCT transform kernel, the Q( u +i),un is 
called basis signal, and matrix T and matrix P are called row 
transform matrix and column transform matrix respectively. If 
the R and C both are equal to N, the matrix T and matrix P are 
the same and both called transform matrix. 

Based on the transform matrix, the 2-D DCT can be 
described as follows. 



r RxC 
IRxC = 



: Tr x r IRxC^CxC 



TrxR^RxC^CxC 



(ii) 

(12) 



The operation form as equation (9) and (10) calculate 2-D 
DCT point-by-point, and called serial computing too. 

IV. 2-D SDCT ALGORITHM 

The main idea of 2-D SDCT doesn't divide 2-D data into 
rows and columns, but to operate the 2-D signal as a whole to 
increase the computing speed. 

A The definition of transform basic matrix 

In order to explore 2-D SDCT, a new transform matrix, T, is 
designed as equation (13). 



T R 2 xC 2 



Qll,RxC 
Q2\,RxC 

Qr\,RxC 



Q\2,RxC 
Ql2,RxC 

Qr2,RxC 



Q\C,RxC 


Q2C,RxC 


QrC,RxC _ 



(13) 



Where, 
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r+l)(c+l),RxC 
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Q(l,0,r,c) 


Q(l,l,r,c) ••• Q(l,C-l,r,c) 




Q(R- 1,0, r,c) 


Q(R-l,l,r,c) •■• Q(R-l,C-l,r,c) 


Q(u,v,r,c) = J—c 


, (u)cos 


~;r(2r + l)u" 
2R 


J— c(v)cos 



(14) 



;r(2c + l)v 
2C 



The Q(u,v,r,c) is called 2-D SDCT transform kernel, the 
Q( r +i)(c+i),j?xc is called basis signal, and the matrix T is called 
transform basic matrix. If the size of 2-D signal is 8^8, the 
transform basic matrix is shown as fig.l. 



2 3 4 5 6 7 



fuJlUUJIII]|ll|| 
■'An 1 1 11 1 T iniinl 






Fig.l Transform basic matrix of 8><8 signal 

Note: The range of transform basic matrix is adjusted to 
[0,255] in order to improve visual effect. 

B. 2-D SDCT Expression 

According to the definition of "sub-matrix product" 
operation and "matrix superimposition product" operation, 
equation (6) and equation (7) can be rewritten as equation (15) 
and equation (16). 



#-ic-i 



R-lC-l 



[m4u = IZf(r,cU^J,c)=YTf(u^ 



r=0c=Q 
RAC-1 



u=0v=0 



u+l)(v+l)Xr+l)(c+l) 



[f(r>4xi = ZZF(u,v)au,v,r,c)=F>Q (r 



u=0v=0 



•+l)(c+l),RxC 



(15) 



(16) 



According to transform basic matrix, the definition of 
"sub-matrix product" operation and "matrix superimposition 
product" operation, equation (15) and equation (16) can be 
abbreviated further as equation (17) and equation (18). 



RxC 



f t 



R<C <T R i xC 2 



IRxC 



zF RxC >r R 2 xC 2 



(17) 
(18) 



The equation (17) and equation (18) are SDCT expression 
form of 2-D DCT. Through analysis of equation (17), we can 
see that superimposing all sub-matrices can accomplish 2-D 
DCT after the product operation of each element of signal f and 
the corresponding sub-matrix of transform basic matrix T is 
completed. In other words, the transform result F can be 
expressed as weighted sum of the 2-D signal f and basis signal. 

From the equation (18), we can see that the "sub-matrix 
product" operation of 2-D transform result F and a certain 
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sub-matrix (basis signal) can obtain a inverse transform result 
corresponding to the sub-matrix position. If the 2-D transform 
result F and each sub-matrix of transform basic matrix are 
performed one time "sub-matrix product" operation 
respectively, RxC results on corresponding position are 
obtained, that is, the inverse transform result, f. So, the 2-D 
DCT computation method using transform basic matrix is 
simple, and easy to understand. 

For example, let the 2-D signal size be 8^8, the transform 
process of DCT can be shown as fig.2 well [13 ' 14] . 




Fig. 2 The operation sketch map of 2-D SDCT 

In addition, the equation (17) and equation (18) can be 
rewritten as follows according to the definition of "sub-matrix 
element position" operation. 



F RxC ~ fRxC >T R 2 xC 2 



fRxC ~ F RxC <r o2^2 



R 2 xC^ 



(19) 



(20) 



Similarly, the operation sketch map according to the 
equation (19) and equation (20) can be shown as Fig. 3. 

V. Parallel computing of 2-D SDCT 

The computing process of 8x8 signal as an example, the 
parallel computing of 2-D DCT is explained as follows. 

A. Parallel computing of 2-D DCT 

After the product operation between each element of f and 
the corresponding sub-matrix of T, superimposing all 
sub-matrixes can accomplish 2-D DCT. In the process of 
operation, the product operation between an arbitrary element 
of f and sub-matrixes of T can be processed simultaneously. 







Fig. 3 The operation sketch map of 2-D SDCT 

The product operation of f(0,0) and Q n as an example, the 
product operation has no contact with other sub-matrix shown 
as Fig.4(a). We call this operation mode for internal parallel 
processing of 2-D DCT, and denote simply by Fig.4 (b). 



f(0,0) 



Q(0,0,0,0 



Q(0, 1,0,0 

-E • 

Q(0,2,0,0 

-E • 

Q(0,3,0,0 



H3- - 

«o,o) q,; 



-E- 



Q(7,6,0,0 



>-E- 



Q(7,7,0,0 
(a) 



(b) 



Fig.4 The sketch of 2-D IDCT internal parallel processing 

In addition, the product operation of 8^8 elements of f and 
64 elements of each sub-matrix can use parallel computing 
shown as Fig. 5, which will generate 8x8 sub-matrixes 
multiplied by corresponding elements of f. So all parallel 
multiply operations are processed in the same time, and the 
actual time-consuming of the multiplication is equivalent to 1 
times multiplication time. 

The 8x8 sub-matrixes multiplied by corresponding 
elements of f only are added up to get the final transform result 
F. In order to solve this superimposing problem, grading 
parallel accumulation strategy is adopted shown as Fig. 5. In 
Fig. 5, every two product sub-matrices are added up to 32 
addends at step 1, every two addends are added up to 16 
addends at step 2. And so on, 8 addends at step 3, 4 addends at 
step 4, 2 addends at step 5 and 1 addends at step 6. So 
transform result F can be obtained by 6 levels of accumulation, 
and the actual time-consuming of the addition is equivalent to 6 
times addition time. We call this operation mode for external 
parallel processing of 2-D DCT. 
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Fig.5 The sketch of 2-D IDCT external parallel processing 

Generally speaking, because of the internal parallel 
processing and the external parallel processing, 2-D DCT 
calculation needs 1 times parallel multiplication and 6 level 
parallel additions. So the actual time-consuming is equivalent 
to 1 times multiplication and 6 times addition time. 

B. Parallel Computing of 2-D IDCT 

The principle and calculation process of 8x8 signal as an 
example, the parallel computing of 2-D IDCT is explained as 
follows. From Fig. 3, we can see obviously that the transform 
matrix has 8^8 sub-matrixes, and the "sub-matrix product" 
operation between F and every sub-matrix generate a pixel 
value in corresponding position. So, the "sub-matrix product" 
operation between 2-D signal and 8x8 sub-matrixes generate 
8x8 pixel value, that is, finish the 2-D signal IDCT computing. 

In the process of operation, the "sub-matrix product" 
operation between F and 8x8 sub-matrixes can be processed 
simultaneously, and 8x8 sub-matrixes has noninterference as 
shown in Fig. 6 below. 



11-0- 



Qll,8x8 /(0,0) 



Ql2,8x 8 f(0,l) 



[-0-O- 
' 8 f(7,0) 



O* 



f(7,7) 



Fig. 6 The sketch of 2-D DCT external parallel processing 

The "sub-matrix product" operation between F and 
sub-matrixes Qn^xs as an example, the pixel result f(0,0) can be 
obtained only based on the F and the sub-matrixes Qi 1,8x8 and 
has no contact with other sub-matrix. We call this operation 
mode for external parallel processing of 2-D DCT. 

In addition, the product operation between 64 elements of 
the F and 64 elements of each sub-matrix can use parallel 



computing also. The product operation between the F and 
sub-matrixes Qn^xs as an example, the calculating process is 
shown as Fig. 7, which will generate 64 products. 



•— 
F(0, 0(0.. 

•— 
F(0, 0(0,1,0, 9 



F(0, Q(0. 
F(0, Q(0,3,0, 
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F(7, Q(7. 

•— 
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I6fii\- 



E 



Fig. 7 The sketch of 2-D DCT internal parallel processing 

The 64 products only are added up to get the final result 
f(0,0). In order to solve this problem, grading parallel 
accumulation strategy is adopted shown as Fig. 7. In Fig. 7, 
every two products are added up to 32 addends at step 1, every 
two addends is added up to 16 addends at step 2. And so on, 8 
addends at step 3, 4 addends at step 4, 2 addends at step 5 and 1 
addends at step 6. So result f(0,0) can be obtained by 6 levels of 
accumulation. Other 63 transform results can be obtained by 
similar parallel computing described as the f(0,0). We call this 
operation mode for internal parallel processing of 2-D IDCT. 

Overall, because of the internal parallel processing and the 
external parallel processing, 2-D IDCT calculating needs 1 
times parallel multiplication and six-step parallel additions. So 
the actual time-consuming equivalent to 1 times multiplication 
and six times addition time. 

VI. Algorithm performance analysis 

In order to explain the algorithm capability, a contrastive 
analysis is carried out between Sun algorithm , CSA, Vetterli 
algorithm [8] , Feig algorithm [9] and the 2-D parallel computing 
(2DPC) proposed in the paper. The equivalent calculating 
amount (EC A) of the four algorithms is listed as Table I . 

Table I The Equivalent Calculation Comparison of 8><8 DCT 



means 


CAS 


Vetterli 


Feig 


Sun 


2DPC 
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+ 
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+ 


X 


+ 


X 


+ 


ECA 


1024 


896 


208 


524 


60 


462 


1 


6 


1 


6 



From table I , we can come to the conclusion safely that 
the equivalent multiplication and addition needed to 
accomplish 2-D DCT by the proposed algorithm reduced 
significantly. The equivalent multiplication needed by the 
proposed algorithm only is 0.098%, 0.48%, 1.67% of CAS, 
Vetterli algorithm, Feig algorithm respectively, and the 
equivalent addition only is 0.67%, 0.1.15%, 1.30%. But the 
calculating amount of the proposed algorithm and the Sun 
algorithm are the same. These outcomes can prove fully the 
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superior performance of the proposed algorithm. 

From the operation process of 8^8 signal, we can see that 
the DCT computation amounts of NxN signal are equivalent to 
1 times multiplication and a few additions, and the time 
required to complete addition is decided by the superposition 
steps. For example, the superposition steps needed equal to 2, 4, 
4, 5, 6, 6 and 6 respectively if the size of 2-D signal is 2x2, 3x3, 
4x4, 5x5, 6x6, 7x7 and 8x8. And so on, to NXZV signal, the 
superposition steps needed is the power exponent of the 
minimum among such numbers which are all more than NxN 
and are power of 2. 

In order to calculate and express conveniently, a operators, 
A, is defined. The operational criterions are described as 
follows. 



aAb = c 



(21) 



Where, c is the power exponent of the minimum among 
such numbers which are all more than a and are a power of b. 

As an example of the calculation process of 7A2, the 
operational criterion of equation (21) is described in details. 
First, search such numbers which all are more than 7 and are a 
power of 2. Obviously, these numbers is 2 3 , 2 4 , 2 5 etc. Second, 
search the minimum among the numbers, that is, 2 3 . Finally, 
calculate the power exponent, that is, 3. And so on, we can see 
that 8A2 is equal to 3; 9 A3 is equal to 2 etc. So NxN signal 
needs (NxN) A 2 steps parallel additions and the actual 
time-consuming is equivalent to (NxN) A 2 times addition time. 

In addition, the multiplication times are not related to the 
size of 2-D signal, and only equivalent to 1 times multiplication. 
But the addition times are related to the size very closely. 
According to the calculation method indicated as equation (21), 
2-D DCT parallel computing will reach optimum efficiency 
when N is integer power of 2. 

VII. Conclusion 

Aiming at the drawbacks of 2-D DCT serial computing, the 
traditional habitual thinking is broken, and a new algorithm, 
2-D SDCT, is proposed in the paper. According to the operation 
rule of 2-D DCT, a new transform matrix and some new matrix 
operation methods are defined in the interest of convenient 
parallel computing. The 2-D DCT can be accomplished by 
parallel computing tidily based on the transform basic matrix. 
The 2-D DCT calculated by the proposed algorithm, 2-D SDCT, 
only needs less calculation time; the operation efficiency 
relative to serial operation is improved notably. In addition, 2-D 



DCT parallel computing will reach optimum efficiency if the 
size of 2-D signal is an integer power of 2. 
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